-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[Android CoreCLR] Log managed callstacks on native crash #123824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Android CoreCLR] Log managed callstacks on native crash #123824
Conversation
|
Tagging subscribers to this area: @agocke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables Android CoreCLR to log managed stack traces when a native crash occurs, providing diagnostic information in environments where CreateDump is not available. The logging is wired through the existing crash-dump hook so that, whenever a dump would be created, the crashing managed thread’s stack is emitted to Android’s logging.
Changes:
- Add a HOST_ANDROID helper in
EEPolicythat walks and logs the current managed thread’s call stack using existingLogCallstackForLogWorkerinfrastructure. - Extend
PROCCreateCrashDumpIfEnabledon Android to call this helper via a weak symbol and emit a formatted “Managed Stacktrace” section tominipal_log_write_fatal, followed by the existing abort message.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/coreclr/vm/eepolicy.cpp |
Introduces LogCallstackForAndroidNativeCrash, an extern "C" Android-only helper that obtains the current Thread and logs its managed call stack, to be invoked from the PAL crash-dump path. |
src/coreclr/pal/src/thread/process.cpp |
On Android, adds a weak reference to LogCallstackForAndroidNativeCrash and, when present, logs a managed stacktrace banner and the managed stack in PROCCreateCrashDumpIfEnabled before logging that the process is aborting. |
| minipal_log_write_fatal("\n=================================================================\n"); | ||
| minipal_log_write_fatal("\tManaged Stacktrace:\n"); | ||
| minipal_log_write_fatal("=================================================================\n"); | ||
| LogCallstackForAndroidNativeCrash(); | ||
| minipal_log_write_fatal("=================================================================\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lateralusX I briefly looked into emitting the callstacks in the same manner that CrashInfo.cs does it. It seemed like its format was designed for managed exceptions, and I don't know if it'd be useful to stub in fields like the type, hr, message, etc. until we know what tools reading the adb log would expect.
If we did want to format it a particular way, it looed liked we'd be adding another API to CallStackLogger, for what seems like a temporary solution until Android CoreCLR can create a dump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not have to stub anything in the format produced by CrashInfo.cs.
We have intentionally used json for this format. json makes it straightforward to omit fields that are not relevant or available in the given situation, and also add additional information specific for the given situation. We have number of optional fields like that, for example
runtime/src/coreclr/nativeaot/System.Private.CoreLib/src/System/CrashInfo.cs
Lines 249 to 253 in 3df8fb2
| if (moduleBase != nint.Zero) | |
| { | |
| if (!WriteHexValue("module"u8, (nuint)moduleBase)) | |
| return false; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to, or we dont need to adopt the same format?
We were considering using the same format since NativeAOT uses it in some scenarios. If we end up generating a dump on Android CoreCLR, it would be nice if it could automatically work with !crashinfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to, or we dont need to adopt the same format?
Sorry, I meant to "We do not nave to stub..."
We were considering using the same format since NativeAOT uses it in some scenarios.
Yes, it would be great to have on json format for crash reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to store this json file in some directory that the apps have write access to by default? Is there a directory like that on Android?
I do not think we want to be printing it to Android console (by default) since it tends to be many pages of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should generate a crash report in the same json format as used by NAOT and crashdumps --crashreport and --crashreportonly. It needs to run in-proc, so the data we put into the crash report needs to be retrieved on best effort and implementation should be aware that it can run from a signal handler. It should follow the same env as we do on other platforms (but only supporting the arguments making sense for crash reports).
Question is if shouldn't be on by default? Reason why it shouldn't be on by default is that the app needs additional service to get the file of device and if nothing like that is used, we could fill up the devices with crash reports and app users needs to explicitly clean them up on an app-by-app basis or by the app on relaunch. If an app has additional infrastructure/services to upload managed crash reports, then they would enable crash reports support.
Each app can have private files, it can have a private cache folder (nuked by the system), it can use external storage as well as using shared storage with other apps. An app can't write into the crash folders used by the system crash daemon.
I think that an app should opt-in to managed crash reporting, give runtime the path to the location of the crash report + additional configuration through existing env crash dump env variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could fill up the devices with crash reports and app users needs to explicitly clean them up on an app-by-app basis or by the app on relaunch.
If is on by default, we need to make sure to store the last crash reports only (or only a few crash reports). Overwrite or delete older crash reports as new ones are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if we decide to enable full crash report by default on Android we need to constrain amount of crash reports, since it will be in text format we could also consider storing it in compressed format to reduce disk footprint even more.
3d0912d to
ffb92cc
Compare
| // TODO: Dump all managed threads callstacks into logcat and/or file? | ||
| if (LogCallstackForAndroidNativeCrash != nullptr) | ||
| { | ||
| minipal_log_write_fatal("\n=================================================================\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the ======================== separators? We do not use a separators like that anywhere else in CoreCLR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they're not needed, I had initially just used the format that Android Mono was using
runtime/src/mono/mono/mini/mini-exceptions.c
Lines 3005 to 3011 in 2772854
| g_async_safe_printf ("\n=================================================================\n"); | |
| g_async_safe_printf ("\tManaged Stacktrace:\n"); | |
| g_async_safe_printf ("=================================================================\n"); | |
| mono_walk_stack_full (print_stack_frame_signal_safe, mctx, jit_tls, mono_get_lmf (), MONO_UNWIND_LOOKUP_IL_OFFSET | MONO_UNWIND_SIGNAL_SAFE, NULL); | |
| g_async_safe_printf ("=================================================================\n"); |
|
What is the experience for unhandled managed exceptions or fail fasts with this fix? We should make sure that the stacktrace is logged just once. |
What was the source code or runtime change for this example? |
| if (LogCallstackForAndroidNativeCrash != nullptr) | ||
| { | ||
| minipal_log_write_fatal("\n=================================================================\n"); | ||
| minipal_log_write_fatal("\tManaged Stacktrace:\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should something more self-explanatory than just "Managed stacktrace:". Maybe something like "Crash in native code".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought PROCCreateCrashDumpIfEnabled also gets hit for unhandled managed exceptions. Would it still be fine to classify that as crash in native code? Maybe also Crash report: would align more given the hope is this would be replaced by real create dump logic.
|
Just curious - would we also consider printing out full mixed callstack (native/managed)? The OS stackwalker will only be able to walk the first native section of the stack, once it hits managed code it will stop. So if there's native code above the managed code which would be interesting, we will have no way to learn about it. We could possibly print out just native addresses for the native portion, and rely on symbolication later. |
I'm modifying the Android.Device_Emulator.JIT.Test Functional test as that was the easiest way to apply runtime changes.
That particular log in the description is the RaiseFailFastException in ep_buffer_write_event. This log is from unhandled managed exception (haven't yet changed the format yet from discussions in above threads) |
|
Here are my thoughts on this:
|
This would be my next suggestion. extend our current stack walker to handler mixed native/managed frames and output symbolicated managed frames, but native frames will probably follow the same format as we get in the tombstone, address + (module + build id) in text. I think this will be beneficial, not just for Android, and it will make it simpler than combining native crash data from tombstone with the managed stack trace in logcat. If we include native frames in our internal stack walker, we also have some more capabilities to handle scenarios that the native stack-walker would have issues handle. On Android we could probably go one additional step on higher API level and use Androids libunwind and cover both native + java frames when stackwalking IP's falling outside dotnet managed code. Do we all agree it would make sense to extend our runtime stack-walker to have an option to produce mixed stack traces, only used for scenarios like this so shouldn't impact other stack walking usage, unless enabled. |
Agree, we should only log the stacktrace for the faulting thread once, since it will go into logcat that is a limited resource. Question is if we should do it on the call site instead of inside PROCCreateCrashDumpIfEnabled and focus that function to produce the json crash report on Android. In the scenario of unhandled managed exception and fail fast the managed stack trace is logged on call site before triggering call ending up in PROCCreateCrashDumpIfEnabled. For native crashes, that would mean logging this information as part of the signal handler, alternative is to change the logic for existing unmanaged exception and fail fast and make sure all scenarios (including signal handler) end up calling through the same function for logging the managed stack and then potentially create the crash dump when enabled. Feels like doing this change will make things a little cleaner and we centralize the logging in one function. |
I think we should start with the managed stacktrace when hitting unhandled HW exceptions + tombstone and it should give you enough information and available in logcat or crash reports submitted to the play console. Next step is probably to do a mixed stacktraces since we will get all information in one place and handle more scenarios than what the native stack walker can handle, will stop on first managed frame. On Mono we do try to dump the native stacktrace followed by a managed stacktrace, but the native stacktrace has been disabled on Android, so Mono currently only dumps the managed stacktrace, then the Android crash daemon will generate the native crash report. Maybe we could put a marker frame on top top frame of the managed stacktrace in case of crash in native code? Then we could do the same for transition in/out of runtime code through the stackwalk and when we have a mixed mode stackwalker, we can just replace those parts with the real native stack frames, something like this: |
Android CoreCLR currently does not ship with CreateDump. When native crashes occur, it is difficult to figure out the underlying cause of the crash, as managed stacks aren't captured anywhere, and android's crash reporter doesn't understand runtime symbols when it generates a tombstone.
Until we have a solution for creating a dump on Android CoreCLR, this PR looks to log managed callstacks for the crashing thread whenever a dump would have been created. That way, there is some hint as to what may have caused the crash, despite not having the native callstack.
These managed callstacks will only be emitted during FailFast. Synchronous faults currently will not hit PROCCreateCrashDumpIfEnabled as Android registers default signal handlers that the runtime defer signal handling to. See #123735
Example ADB log